Cardio Good Fitness

Objective¶

The dataset needs to be explored to identify differences between the customers of each product and to find the relationships between the different attributes of the customers. Also the features of the datasets has to be approched to come up with the insights relevant for the business. Python will be used for all these Analysis.

Data Description:¶

The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables-

  1. Product - The model no. of the treadmill
  2. Age - Age of the customer in no of years
  3. Gender - Gender of the customer
  4. Education - Education of the customer in no. of years
  5. Marital Status - Marital status of the customer
  6. Usage - Avg. # times the customer wants to use the treadmill every week
  7. Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
  8. Income - Income of the customer
  9. Miles- Miles that a customer expects to run

Understanding the structure of the data¶

Importing necessary libraries

In [335]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Importing the dataset

In [336]:
# Read the CSV file and store it in the Dataframe
cardio_fit_data= pd.read_csv('CardioGoodFitness.csv')

The first and last 5 rows of the dataset

In [337]:
cardio_fit_data.head()
Out[337]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
0 TM195 18 Male 14 Single 3 4 29562 112
1 TM195 19 Male 15 Single 2 3 31836 75
2 TM195 19 Female 14 Partnered 4 3 30699 66
3 TM195 19 Male 12 Single 3 3 32973 85
4 TM195 20 Male 13 Partnered 4 2 35247 47
In [338]:
cardio_fit_data.tail()
Out[338]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
175 TM798 40 Male 21 Single 6 5 83416 200
176 TM798 42 Male 18 Single 5 4 89641 200
177 TM798 45 Male 16 Single 5 5 90886 160
178 TM798 47 Male 18 Partnered 4 5 104581 120
179 TM798 48 Male 18 Partnered 4 5 95508 180

The shape of the dataset

In [339]:
# checking shape of the data
print("There are", cardio_fit_data.shape[0], 'rows and', cardio_fit_data.shape[1], "columns.")
There are 180 rows and 9 columns.

Finding whether the data contains any missing values or duplicate rows are very important

In [340]:
# checking missing values
cardio_fit_data.isnull().sum()
Out[340]:
Product          0
Age              0
Gender           0
Education        0
MaritalStatus    0
Usage            0
Fitness          0
Income           0
Miles            0
dtype: int64
In [341]:
#Checking duplicate rows
df.duplicated().sum()
Out[341]:
0

Check the data types of the columns for the dataset

In [342]:
cardio_fit_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product        180 non-null    object
 1   Age            180 non-null    int64 
 2   Gender         180 non-null    object
 3   Education      180 non-null    int64 
 4   MaritalStatus  180 non-null    object
 5   Usage          180 non-null    int64 
 6   Fitness        180 non-null    int64 
 7   Income         180 non-null    int64 
 8   Miles          180 non-null    int64 
dtypes: int64(6), object(3)
memory usage: 12.8+ KB
  • There are 6 numerical columns in the data and 3 object type columns.
  • Date column is not there in this data, so we don't need to transform any column.
  • Aslo 4 integer data looks correct and we don't need to convert it to float or any other data type.

Let's check the count and percentage of categorical levels in each column

In [343]:
# Making a list of all categorical variables
cat_cols =  ['Product', 'Gender', 'MaritalStatus']

# Printing the count of unique categorical levels in each column
for column in cat_cols:
    print(cardio_fit_data[column].value_counts())
    print("-" * 50)
TM195    80
TM498    60
TM798    40
Name: Product, dtype: int64
--------------------------------------------------
Male      104
Female     76
Name: Gender, dtype: int64
--------------------------------------------------
Partnered    107
Single        73
Name: MaritalStatus, dtype: int64
--------------------------------------------------
In [344]:
# Printing the percentage of unique categorical levels in each column
for column in cat_cols:
    print(cardio_fit_data[column].value_counts(normalize=True))
    print("-" * 50)
TM195   0.444
TM498   0.333
TM798   0.222
Name: Product, dtype: float64
--------------------------------------------------
Male     0.578
Female   0.422
Name: Gender, dtype: float64
--------------------------------------------------
Partnered   0.594
Single      0.406
Name: MaritalStatus, dtype: float64
--------------------------------------------------

Observations

  • The product TM195 is purchased by many customers compare to other two products.
  • Male customers are higher than female customers.
  • Partnered customers are buying the treadmil more than the Single. This will help them to use it as a family.

Checking the statistical summary of the data.

In [345]:
cardio_fit_data.describe(include='all').T
Out[345]:
count unique top freq mean std min 25% 50% 75% max
Product 180 3 TM195 80 NaN NaN NaN NaN NaN NaN NaN
Age 180.000 NaN NaN NaN 28.789 6.943 18.000 24.000 26.000 33.000 50.000
Gender 180 2 Male 104 NaN NaN NaN NaN NaN NaN NaN
Education 180.000 NaN NaN NaN 15.572 1.617 12.000 14.000 16.000 16.000 21.000
MaritalStatus 180 2 Partnered 107 NaN NaN NaN NaN NaN NaN NaN
Usage 180.000 NaN NaN NaN 3.456 1.085 2.000 3.000 3.000 4.000 7.000
Fitness 180.000 NaN NaN NaN 3.311 0.959 1.000 3.000 3.000 4.000 5.000
Income 180.000 NaN NaN NaN 53719.578 16506.684 29562.000 44058.750 50596.500 58668.000 104581.000
Miles 180.000 NaN NaN NaN 103.194 51.864 21.000 66.000 94.000 114.750 360.000
  • Age: On average customers are around 28 years old with a minimum of 18 years and maximum of 50 years.
  • Education: Minimum is 12 with a average of 15. So, its clear that each customer crossed the high schooling.
  • Usage: This number represents the average number of times the customer wants to use the treadmil every week. With that the 75% percentile and the maximum values are looking normal.
  • Fitness: Its the self rated fitness score with minimum of 1 and maximum of 5. This data also looks Valid.
  • Income: This number shows the average income of the customers. And maximum of the customers(75%) are getting below 60k. But the maximum value is 104581. So very few people are getting higher income too.
  • Miles: Same like income, in miles also 75% of the customers are running below 115 miles, but some are running more miles upto 360.

Univariate Data Analysis¶

Let's check the distribution for numerical columns.

In [346]:
# Defining the function for creating boxplot and hisogram 
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize)  # creating the 2 subplots
    
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="mediumturquoise")  # boxplot will be created and a star will indicate the mean value of the column
    
    if bins:
      sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="mediumpurple")
    else: 
      sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, color="mediumpurple")  # For histogram
    
    ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--")  # Add mean to the histogram
    
    ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")  # Add median to the histogram

Observations on Age

In [347]:
histogram_boxplot(cardio_fit_data,'Age')
  • The distribution is skewed towards right.
  • There are some Outliers present in this column.

Observations on Education

In [348]:
sns.boxplot(data=cardio_fit_data,x='Education')
plt.show()
In [349]:
sns.displot(data=cardio_fit_data,x='Education',kind='kde')
plt.show()

Skewness

Skewness is a measure of asymmetry of a distribution.

· In a normal distribution, the mean divides the curve symmetrically into two equal parts at the median and the value of skewness is zero.

· When a distribution is asymmetrical the tail of the distribution is skewed to one side-to the right or to the left.

· When the value of the skewness is negative, the tail of the distribution is longer towards the left hand side of the curve.

· When the value of the skewness is positive, the tail of the distribution is longer towards the right hand side of the curve

In [350]:
cardio_fit_data.skew(axis = 0, skipna = True,numeric_only=True)
Out[350]:
Age         0.982
Education   0.622
Usage       0.739
Fitness     0.455
Income      1.292
Miles       1.724
dtype: float64

Kurtosis

· Kurtosis is one of the two measures that quantify shape of a distribution. kutosis determine the volume of the outlier

· Kurtosis describes the peakedness of the distribution.

If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes.

A flat distribution where the values are moderately spread out (i.e., unlike leptokurtic) is called platykurtic(Kurtosis <3) distribution.

A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a mesokurtic(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.

· Kurtosis is sometimes reported as “excess kurtosis.” Excess kurtosis is determined by subtracting 3 from the kurtosis. This makes the normal distribution kurtosis equal 0.

In [351]:
cardio_fit_data.kurt(axis = 0, skipna = True,numeric_only=True)
Out[351]:
Age          0.410
Education    1.033
Usage        0.543
Fitness     -0.369
Income       1.374
Miles        4.321
dtype: float64
In [352]:
cardio_fit_data.loc[cardio_fit_data['Education']>18]
Out[352]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
156 TM798 25 Male 20 Partnered 4 5 74701 170
157 TM798 26 Female 21 Single 4 3 69721 100
161 TM798 27 Male 21 Partnered 4 4 90886 100
175 TM798 40 Male 21 Single 6 5 83416 200
  • There are only 4 records with education level greater than or equal to 20.

Observations on Usage

In [353]:
histogram_boxplot(cardio_fit_data,'Usage')
In [354]:
sns.displot(data=cardio_fit_data,x='Usage',kind='kde')
plt.show()
  • Usage is having maximum value as 3 days and then 4 days.
  • Some customers also have values 2,5, and 6.
  • The data is slightly having tail towards right but we can say that the data almost normally distributed.
In [355]:
histogram_boxplot(cardio_fit_data,'Fitness')
  • Fitness has some outliers in minimum value.
  • The Values 2,4, and 5 are having almost equal values with maximum customers in Fitness level 3.
  • So by seeing curve, we can say that the data is normally distributed.

Observations on Income

In [356]:
histogram_boxplot(cardio_fit_data,'Income')
  • The distribution of the Income is skewed towards the right.
  • There are many outliers in this variable and the values above 70000 are being represented as outliers by the boxplot.
  • The values seem fine as the values are continuous.

Observations on Miles

In [357]:
histogram_boxplot(cardio_fit_data,'Miles')
  • The distribution of the Miles is skewed towards the right.
  • There are many outliers in this variable and the values above 200 are being represented as outliers by the boxplot.
  • The values seem fine as the values are continuous, but we have the gap between the maximum and the second maximum value.
In [358]:
cardio_fit_data.loc[cardio_fit_data['Miles']>200].shape
Out[358]:
(6, 9)
In [359]:
# findig the type of such properties
cardio_fit_data.loc[cardio_fit_data['Miles']>200,'Gender'].value_counts()
Out[359]:
Male      4
Female    2
Name: Gender, dtype: int64
  • There are only 6 such customers (4 Male & 2 Female) who have more than 200 miles.

Let's explore the categorical variables now

In [360]:
sns.countplot(data=cardio_fit_data,x='Product');
In [361]:
sns.countplot(data=cardio_fit_data,x='Gender');
In [362]:
sns.countplot(data=cardio_fit_data,x='MaritalStatus');

From the above charts, its easy to figure out which categorical value is more customers than the others.

Multivariate Data Analysis¶

  • Multivariate data analysis refers to all statistical methods that simultaneously analyze multiple measurements on each individual respondent or object under investigation.

  • This method is used principally for four reasons, i.e. to see patterns of data, to make clear comparisons, to discard unwanted information and to study multiple factors at once.

In [363]:
cardio_fit_data.cov()
Out[363]:
Age Education Usage Fitness Income Miles
Age 48.212 3.149 0.113 0.407 58844.463 13.187
Education 3.149 2.615 0.693 0.637 16704.718 25.771
Usage 0.113 0.693 1.177 0.695 9303.043 42.710
Fitness 0.407 0.637 0.695 0.919 8467.925 39.073
Income 58844.463 16704.718 9303.043 8467.925 272470624.145 465265.362
Miles 13.187 25.771 42.710 39.073 465265.362 2689.833
In [364]:
cardio_fit_data.corr()
Out[364]:
Age Education Usage Fitness Income Miles
Age 1.000 0.280 0.015 0.061 0.513 0.037
Education 0.280 1.000 0.395 0.411 0.626 0.307
Usage 0.015 0.395 1.000 0.669 0.520 0.759
Fitness 0.061 0.411 0.669 1.000 0.535 0.786
Income 0.513 0.626 0.520 0.535 1.000 0.543
Miles 0.037 0.307 0.759 0.786 0.543 1.000
In [365]:
# lets check the correlation between two variables- Age & Fitness
cardio_fit_data[['Age','Fitness']].corr()
Out[365]:
Age Fitness
Age 1.000 0.061
Fitness 0.061 1.000
  • A heatmap is a graphical representation of data as a color-encoded matrix. It is a great way of representing the correlation for each pair of columns in the data.The heatmap() function of seaborn helps us to create such a plot
In [366]:
plt.figure(figsize=(10,5))
sns.heatmap(cardio_fit_data.corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()

Observations

  • From the above chart its clear that none of the two variables are negatively correlated.

  • Fitness is highly correlated with Usage and Miles which makes sense.

  • Age is slightly related to Education & Income. But its no where related to Fitness, Usage & Miles.

  • Heat map is always very much useful to compare the variables in a bigger picture.

In [367]:
#num_var = ['Age','Education','Usage','Fitness','Income', 'Miles']
#sns.pairplot(data=cardio_fit_data[num_var], diag_kind="kde")
sns.pairplot(data=cardio_fit_data, kind="reg")
plt.show()

Observations

  • Pair plot is very helpful to see how each column/feature is distributed.

  • Also it helps to find the best fit regression line between the variables.

  • Age, Income and Miles are right skewed and Education, Usage and Fitness are almost normally distributed.

In [368]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=cardio_fit_data,x='Product',y='Miles',hue='Gender')
plt.show()
In [369]:
sns.lineplot(data=cardio_fit_data, x='Usage', y ='Fitness',ci=None)
plt.show()
In [370]:
plt.figure(figsize=(15,7))
sns.lineplot(data=cardio_fit_data, x='Miles', y ='Fitness',ci=None)
plt.show()
In [371]:
sns.lmplot(y = 'Fitness', x = 'Miles', hue = 'MaritalStatus', data = cardio_fit_data);
  • We can also use Scatter plot, lineplot or lmplot instead of pairplot if we want to check two specific variables alone.
  • A positive correlation or an increasing trend can be clearly observed between the Miles and Fitness.
  • The positive correlation indicates that more Miles implies a higher Fitness.
In [372]:
sns.barplot(data = cardio_fit_data, x= 'Product', y='Usage', hue ='Gender');
In [373]:
sns.barplot(data = cardio_fit_data, x= 'Product', y='Miles', hue ='MaritalStatus')
plt.xticks(rotation=90);
In [374]:
sns.barplot(data = cardio_fit_data, x= 'Product', y='Usage', hue ='Fitness');
In [375]:
sns.barplot(data = cardio_fit_data, x= 'Product', y='Income', hue ='Fitness')
plt.xticks(rotation=90);

The plots created in seaborn are very useful, but not interactive. To create some interactive graphs, we can use plotly as below¶

In [376]:
# importing plotly
import plotly.express as px
In [412]:
#Creating a bar chart using plotly to show the top 10 states
fig = px.bar(cardio_fit_data, x="Product", y="Usage",  
                   title ="Usage of each Product",
                   width = 800, height = 400,
                   template="simple_white")
fig.show(renderer='notebook')
In [413]:
#fig = px.bar(cardio_fit_data, x="Product", y="Miles", color="Fitness", barmode="group", facet_col="MaritalStatus")
fig = px.bar(cardio_fit_data, x="Product", y="Miles", color="Fitness", facet_col="MaritalStatus")
fig.show(renderer='notebook')
In [414]:
fig = px.scatter(cardio_fit_data, x="Product", y="Miles", color="Gender", symbol="Fitness", facet_col="MaritalStatus")
fig.show(renderer='notebook')
In [415]:
fig = px.scatter(cardio_fit_data, x="Product", y="Miles", color="Gender", symbol="Usage", facet_col="MaritalStatus")
fig.show(renderer='notebook')

We can also seperate the dataset like below if incase we want to deep dive into the dataset

In [381]:
Male = cardio_fit_data['Gender'] == 'Male'
maledata= cardio_fit_data[Male]
Female = cardio_fit_data['Gender'] == 'Female'
femaledata= cardio_fit_data[Female]
single = cardio_fit_data['MaritalStatus'] == 'Single'
singledata= cardio_fit_data[single]
partnered = cardio_fit_data['MaritalStatus'] == 'Partnered'
partnereddata= cardio_fit_data[partnered]
In [382]:
#creates count plot for the number of single male and female customers that bought the different products.
plt.figure(figsize=(10, 5)) #change figure size
sns.countplot(x='Gender',hue = 'Product', data= singledata)
plt.title("Number of single customers with respect to gender and product bought", fontsize = 16) #titles the graph.
plt.ylabel('Product bought', fontsize = 12) #changes y axis label.
plt.xlabel('Gender', fontsize = 12) #changes x axis label.
plt.show() #shows graph.
In [383]:
#creates count plot for all the Age customers who bought the different products.
sns.displot(maledata, x='Age', hue = 'Product', multiple = 'stack') 
plt.title("Distribution of male customers age with respect to product") #titles the graph.
plt.ylabel('Product bought') #changes y axis label.
plt.xlabel('Age (years)') #changes x axis label.
plt.show() #shows graph.
  • sns.relplot() is used to visualize any statistical relationships between quantitative variables.
  • Why use relplot() instead of scatterplot() ?
    • relplot() lets you create multiple plots on a single axis.
      • kind - specifies the kind of plot to draw (scatter or line)
      • ci - specifies the confidence interval
      • col_wrap - specifies the number of columns in the grid
In [384]:
sns.relplot(data=cardio_fit_data,x='Usage',y='Fitness',col='MaritalStatus',kind='line', ci=None, col_wrap=4)
plt.show()
# double click on the plot to zoom in
In [385]:
sns.relplot(data=cardio_fit_data,x='Usage',y='Fitness',col='Gender',kind='line', ci=None, col_wrap=4)
plt.show()
In [386]:
sns.relplot(data=cardio_fit_data,x='Income',y='Fitness',col='MaritalStatus',kind='line', ci=None, col_wrap=4)
plt.show()
# double click on the plot to zoom in
In [387]:
sns.catplot(x='Education', y='Fitness', data=cardio_fit_data, kind="bar", hue='Gender')
plt.show()
In [388]:
sns.catplot(x='Product', y='Education', data=cardio_fit_data, kind="bar", hue='Gender')
plt.show()
In [389]:
sns.catplot(x='Usage', y='Fitness', data=cardio_fit_data, kind="bar", hue='MaritalStatus')
plt.show()
In [390]:
sns.catplot(x="Fitness", data = cardio_fit_data, col="Gender", kind = "count");
In [391]:
plt.figure(figsize=(15,7))
palette = sns.color_palette("mako_r", 6)
sns.lineplot(data=cardio_fit_data, x="Miles", y="Income", hue='Fitness', style="Gender", palette="pastel", ci=False,)
plt.ylabel('Income of Customer')
plt.xlabel('Expected Miles to run')
plt.show()

Below graphs will help us to understand the data in terms of the Products

In [392]:
plt.figure(figsize=(15,7))
sns.lineplot(data=cardio_fit_data, x="Miles", y="Usage", hue='Product', estimator='sum', ci=False)
plt.ylabel('Avg # of times per week')
plt.xlabel('Expected Miles to run')
plt.show()
In [393]:
plt.figure(figsize=(10,5))
sns.boxplot(data=cardio_fit_data,x='Product',y='Usage',showfliers=False) # turning off outliers
plt.xticks(rotation=90)
plt.show()
In [394]:
plt.figure(figsize=(10,5))
#sns.boxplot(data=cardio_fit_data,x='Product',y='Fitness',showfliers=False) # turning off outliers
sns.boxplot(data=cardio_fit_data,x='Product',y='Fitness')
plt.xticks(rotation=90)
plt.show()
In [395]:
plt.figure(figsize=(10,5))
sns.boxplot(data=cardio_fit_data,x='Product',y='Miles',showfliers=False) # turning off outliers
plt.xticks(rotation=90)
plt.show()
In [396]:
# Fitness measure for every product
sns.catplot(x='Fitness',
            col='Product', 
            data=cardio_fit_data,
            col_wrap=4,
            kind="violin")
plt.show()
In [397]:
# Usage measure for every product
sns.catplot(x='Usage',
            col='Product', 
            data=cardio_fit_data,
            col_wrap=4,
            kind="violin")
plt.show()

Creating bins for Age column

  • 18 - 30 Years - The Customer is in 18-30 Category.
  • 31 - 40 Years - The Customer is in 30+ Category.
  • 40 - 52 Years - The Customer is in 40+ Category.

We will use pd.cut() function to create the bins in Age column.

Syntax: pd.cut(x, bins, labels=None, right=False)

x - column/array to binned
bins - number of bins to create or an input of list for the range of bins
labels - specifies the labels for the bins
right - If set to False, it excludes the rightmost edge of the interval
In [398]:
# using pd.cut() function to create bins
cardio_fit_data['Age_Category'] = pd.cut(cardio_fit_data['Age'],bins=[18,30,40,52],labels=['18-30','30+','40+'], right = False)
In [399]:
sns.histplot(data=cardio_fit_data,x='Age_Category',stat='density')
plt.show()
In [400]:
cardio_fit_data['Age_Category'].unique()
Out[400]:
['18-30', '30+', '40+']
Categories (3, object): ['18-30' < '30+' < '40+']
In [401]:
cardio_fit_data.head(10)
Out[401]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles Age_Category
0 TM195 18 Male 14 Single 3 4 29562 112 18-30
1 TM195 19 Male 15 Single 2 3 31836 75 18-30
2 TM195 19 Female 14 Partnered 4 3 30699 66 18-30
3 TM195 19 Male 12 Single 3 3 32973 85 18-30
4 TM195 20 Male 13 Partnered 4 2 35247 47 18-30
5 TM195 20 Female 14 Partnered 3 3 32973 66 18-30
6 TM195 21 Female 14 Partnered 3 3 35247 75 18-30
7 TM195 21 Male 13 Single 3 3 32973 85 18-30
8 TM195 21 Male 15 Single 5 4 35247 141 18-30
9 TM195 21 Female 15 Partnered 2 3 37521 85 18-30
In [402]:
cardio_fit_data.tail(10)
Out[402]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles Age_Category
170 TM798 31 Male 16 Partnered 6 5 89641 260 30+
171 TM798 33 Female 18 Partnered 4 5 95866 200 30+
172 TM798 34 Male 16 Single 5 5 92131 150 30+
173 TM798 35 Male 16 Partnered 4 5 92131 360 30+
174 TM798 38 Male 18 Partnered 5 5 104581 150 30+
175 TM798 40 Male 21 Single 6 5 83416 200 40+
176 TM798 42 Male 18 Single 5 4 89641 200 40+
177 TM798 45 Male 16 Single 5 5 90886 160 40+
178 TM798 47 Male 18 Partnered 4 5 104581 120 40+
179 TM798 48 Male 18 Partnered 4 5 95508 180 40+

Outlier Detection and Treatment¶

  • An outlier is a data point that are abnormally/unrealistically distant from other points in the data.

  • The challenge with outlier detection is determining if a point is truly a problem or simply a large value. If a point is genuine then it is very important to keep it in the data as otherwise we're removing the most interesting data points.

  • It is left to the best judgement of the investigator to decide whether treating outliers is necessary and how to go about it. Domain Knowledge and impact of the business problem tend to drive this decision.

Handling outliers

Some of the commonly methods to deal with the data points that we actually flag as outliers are:

  • Replacement with null values - We can consider these data points as missing data and replace the abnormal values with NaNs.
  • IQR method - Replace the data points with the lower whisker (Q1 - 1.5 IQR) or upper whisker (Q3 + 1.5 IQR) value.
  • We can also drop these observations, but we might end up with losing other relevant observations as well.

So, it is often a good idea to examine the results by running an analysis with and without outliers.

Visualization of all the outliers present in data

In [403]:
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list 
numeric_columns = ['Age','Education','Usage','Fitness','Income', 'Miles']

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(3, 3, i + 1)
    plt.boxplot(cardio_fit_data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Let's find the percentage of outliers, in each column of the data, using IQR.

Treating outliers

We will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively.

Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.

In [404]:
# Finding the 25th percentile and 75th percentile for the numerical columns.
Q1 = cardio_fit_data[numeric_columns].quantile(0.25)
Q3 = cardio_fit_data[numeric_columns].quantile(0.75)

IQR = Q3 - Q1                   #Inter Quantile Range (75th percentile - 25th percentile)

lower_whisker = Q1 - 1.5*IQR    #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
In [405]:
# Percentage of outliers in each column
((cardio_fit_data[numeric_columns] < lower_whisker) | (cardio_fit_data[numeric_columns] > upper_whisker)).sum()/cardio_fit_data.shape[0]*100
Out[405]:
Age          2.778
Education    2.222
Usage        5.000
Fitness      1.111
Income      10.556
Miles        7.222
dtype: float64
In [406]:
    Q1 = cardio_fit_data['Miles'].quantile(0.25)  # 25th quantile
    Q3 = cardio_fit_data['Miles'].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR
    print(lower_whisker)
    print(upper_whisker)
-7.125
187.875
In [407]:
cardio_fit_data.loc[cardio_fit_data['Miles']>188].sort_values('Miles',ascending=False)
Out[407]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles Age_Category
173 TM798 35 Male 16 Partnered 4 5 92131 360 30+
166 TM798 29 Male 14 Partnered 7 5 85906 300 18-30
167 TM798 30 Female 16 Partnered 6 5 90886 280 30+
170 TM798 31 Male 16 Partnered 6 5 89641 260 30+
155 TM798 25 Male 18 Partnered 6 5 75946 240 18-30
84 TM498 21 Female 14 Partnered 5 4 34110 212 18-30
142 TM798 22 Male 18 Single 4 5 48556 200 18-30
148 TM798 24 Female 16 Single 5 5 52291 200 18-30
152 TM798 25 Female 18 Partnered 5 5 61006 200 18-30
171 TM798 33 Female 18 Partnered 4 5 95866 200 30+
175 TM798 40 Male 21 Single 6 5 83416 200 40+
176 TM798 42 Male 18 Single 5 4 89641 200 40+
In [408]:
#Treating outliers
#cardio_fit_data['Miles'] = np.clip(cardio_fit_data['Miles'], lower_whisker, upper_whisker)
#sns.boxplot(data=cardio_fit_data,x='Miles')
#plt.show()
In [409]:
    Q1 = cardio_fit_data['Income'].quantile(0.25)  # 25th quantile
    Q3 = cardio_fit_data['Income'].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR
    print(lower_whisker)
    print(upper_whisker)
22144.875
80581.875
In [410]:
cardio_fit_data.loc[cardio_fit_data['Income']>80581].sort_values('Income',ascending=False)
Out[410]:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles Age_Category
174 TM798 38 Male 18 Partnered 5 5 104581 150 30+
178 TM798 47 Male 18 Partnered 4 5 104581 120 40+
168 TM798 30 Male 18 Partnered 5 4 103336 160 30+
169 TM798 30 Male 18 Partnered 5 5 99601 150 30+
171 TM798 33 Female 18 Partnered 4 5 95866 200 30+
179 TM798 48 Male 18 Partnered 4 5 95508 180 40+
162 TM798 28 Female 18 Partnered 6 5 92131 180 18-30
172 TM798 34 Male 16 Single 5 5 92131 150 30+
173 TM798 35 Male 16 Partnered 4 5 92131 360 30+
161 TM798 27 Male 21 Partnered 4 4 90886 100 18-30
177 TM798 45 Male 16 Single 5 5 90886 160 40+
167 TM798 30 Female 16 Partnered 6 5 90886 280 30+
176 TM798 42 Male 18 Single 5 4 89641 200 40+
170 TM798 31 Male 16 Partnered 6 5 89641 260 30+
160 TM798 27 Male 18 Single 4 3 88396 100 18-30
164 TM798 28 Male 18 Single 6 5 88396 150 18-30
166 TM798 29 Male 14 Partnered 7 5 85906 300 18-30
175 TM798 40 Male 21 Single 6 5 83416 200 40+
159 TM798 27 Male 16 Partnered 4 5 83416 160 18-30
In [411]:
#Treating outliers
#cardio_fit_data['Income'] = np.clip(cardio_fit_data['Income'], lower_whisker, upper_whisker)
#sns.boxplot(data=cardio_fit_data,x='Income')
#plt.show()

As the outliers of this data set are looking valid and continuous, we don't need to necessarily treat the outliers. Removing the proper values will also impact the analysis.But above method shows the ways of removing outliers if incase we find any

Actionable Insights and Recommendations¶

Insights¶

We analyzed a dataset of three different TrendMil products and the customers who bought those. The product was purchased by customers of different ages between 18 to 50 both single & married. The main feature of interest here is the Miles and Fitness values(Self rated value). Its good to see the data related to health and Fitness which is very much necessary these days.

We have been able to conclude that -

  1. The product TM195 was purchased by most number of customers.
  2. Users who bought the Product TM798 are expected to run more Miles than the other two, especially the partnered persons.
  3. Fitness ranges: TM195 1 to 5, TM498 1 to 4, TM798 3 to 5. From this we can say that the persons who are more into Fitness are buying the third product which is TM798.
  4. The income of people who bought TM798 are higher than the other two. So we may say that this product is little costlier than the other two and maybe with some extra special feature.
  5. Product TM195, TM798 was purchased more by Male and TM498 was by Female customers.
  6. According to the education level the fitness varies. People with higher education are slightly more Fitness than others.
  7. Fitness is highly correlated with Usage and Miles which makes sense.
  8. Age, Income and Miles are right skewed and Education, Usage and Fitness are almost normally distributed.

Recommendations to business¶

  1. All the Age people has to be focused, as we can see people in 40+ bin are buying very less next to 30+. But this is the age in which customers need to focus more on health. So encourage all age for fitness essential to improve the business.
  2. Need to get more customer information like email ID, Phone number. This will help the business to communicate the best information about the products also the offers.
  3. To focus on all education level customers, the product details need to be advertised in schools, Universities and offices.
  4. TM498 product customers are Usage is relatively very low compare to other two. Store has to analyze whether this is something to do with the Product feature or quality.
  5. Same way the product TM195 customers rated their Fitness in average range if at all their usage is in good level. So, its better to check the product. Also, the feedback can be taken from customers on this issue.

Further Analysis that can be done¶

  1. Business can do some Machine Learning on the dataset to see the future predictions.
  2. Outliers can be analyzed further to check whether the data is real or not.

Make your body into your most beautiful outfit! Happy Sweating with TreadMills!! Fitness Matters!!!